Using Multiple Discriminant Analysis Approach for Linear Text Segmentation
نویسندگان
چکیده
Research on linear text segmentation has been an on-going focus in NLP for the last decade, and it has great potential for a wide range of applications such as document summarization, information retrieval and text understanding. However, for linear text segmentation, there are two critical problems involving automatic boundary detection and automatic determination of the number of segments in a document. In this paper, we propose a new domain-independent statistical model for linear text segmentation. In our model, Multiple Discriminant Analysis (MDA) criterion function is used to achieve global optimization in finding the best segmentation by means of the largest word similarity within a segment and the smallest word similarity between segments. To alleviate the high computational complexity problem introduced by the model, genetic algorithms (GAs) are used. Comparative experimental results show that our method based on MDA criterion functions has achieved higher Pk measure (Beeferman) than that of the baseline system using TextTiling algorithm.
منابع مشابه
Experiments in Unconstrained Offline Handwritten Text Recognition
A system for off-line handwritten text recognition is presented. It is characterized by a segmentation-free approach, i.e. whole lines of text are processed by the recognition module. The methods used for pre-processing, feature extraction, and statistical modelling are described, and several experiments on writer-independent, multiple writer, and single writer handwriting recognition tasks are...
متن کاملRecursive Algorithms for Image Segmentation Based on a Discriminant Criterion
In this study, a new criterion for determining the number of classes an image should be segmented is proposed. This criterion is based on discriminant analysis for measuring the separability among the segmented classes of pixels. Based on the new discriminant criterion, two algorithms for recursively segmenting the image into determined number of classes are proposed. The proposed methods can a...
متن کاملA Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space
A novel algorithm for motion segmentation is proposed. The algorithm uses the fact that shape of an object with homogeneous motion is represented as 4 dimensional linear space. Thus motion segmentation is done as the decomposition of shape space of multiple objects into a set of 4 dimensional subspace. The decomposition is realized using the discriminant analysis of orthogonal projection matrix...
متن کاملText Segmentation with Topic Modeling and Entity Coherence
This paper describes a system which uses entity and topic coherence for improved Text Segmentation (TS) accuracy. First, Linear Dirichlet Allocation (LDA) algorithm was used to obtain topics for sentences in the document. We then performed entity mapping across a window in order to discover the transition of entities within sentences. We used the information obtained to support our LDA-based bo...
متن کاملA Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling
In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...
متن کامل